The goal for this project is to identify whether intrastate travel has any affect on the increase of COVID rates per state. We will clean and merge two datasets, one containing the daily COVID rates per state and the other summarizing intrastate travel per state. We intend to focus on daily positive increases as a function of intrastate travel, and will explore the data using both visual and predictive analysis. In despite of analyzing the data from multiple angles, our models predict that there is no direct correlation between the spread of COVID-19 and the number of trips within a state.
Note: Please download the HTML notebook file (or run this file) on your local machine to see the visualizations properly.
"Can the new daily COVID-19 cases per state be explained by the number and distance of travel in the US?"
Between August 2nd and September 5th 2020, the CDC reported that the weekly COVID-19 cases within individuals aged 18 and 22 increased to 55% across the nation [1]. This has resulted in warnings issued to Americans from international travel, pointing to the enhancement of safety and health conditions in certain states (cdc.gov). The CDC has continually informed the public on how traveling potentially increases the possibility of being infected and spreading the coronavirus and suggests that staying at home is the best preventive measure. However, many individuals have not yet taken such measures of spreading the virus seriously and still travel around [2]. Therefore, to adequately create public awareness and offer scientific analysis on the necessity of keeping of maintaining social distance, we demonstrate the correlation between the spread of COVID-19 and traveling mathematically through data algorithms [3].
Various studies have been carried out to offer the algorithms for different viruses’ widespread presence techniques or predict the disease's spread rate. For example, WU and Leung [4] established the framework used to indicate the possible local and international coronavirus spread.
In this detailed data analysis, the correlation between traveling between cities with new confirmed cases of COVID-19 in United States is demonstrated. The impact of air and urban traffic involving the passenger population and the number of flights, the new COVID-19 cases have been probed. This data's significance is that irrespective of the deficiency of comprehensive information concerning the data figure of travels among Americans, a relationship between the accumulative number of trips and the spread of COVID-19 cases can be established.
References (include links):
Our primary hypothesis is that we predict that the number of new daily COVID-19 cases is positively correlated with the number travels and hope to illustrate that less travel results in less cases. Along with the number of travels, we also predict that the distance of travel is positively correlated with the number of new daily COVID-19 cases, though weaker than to the total number of travels in that state.
We believe in these such ways as COVID-19 is a respiratory disease and it is known to be transmitted mainly through respiratory droplets produced when an infected person coughs, sneezes, or talks. These droplets can land in the mouths or noses of people who are nearby or possibly be inhaled into the lungs. With that in mind, we hypotheisze that more travels and longer trips lead to a higher chance of unintentional COVID-19 spread.
Our first dataset is "Trips by Distance" from the US Department of Transportation, Bureau of Transportation Statistics. The dataset contains a number of trips separated by distance (1-3, 3-5, 5-10, etc) and by county/state, starting on January 1st, 2019. But, since COVID-19 dataset starts from 1/22/2020, we will use only a partial date dataset (1/22/2020 - 10/10/2020).
Trips are defined as movements that include a stay of longer than 10minutes at an anonymized local away from home. Trips capture travel by all modes of transportation including driving, rail, and air. This dataset tracks mobile data geolocation as an indication for movement. It doesn't specify between modes of transportation, however we consider this more significant as it captures motion otherwise uncollectable.
Our download of this dataset is October 10th, 2020.
https://data.bts.gov/Research-and-Statistics/Trips-by-Distance/w96p-f2qv
Our second data comes from Kaggle, us_states_covid19_daily.csv. This dataset shows the number of daily COVID-19 cases per day by US state.
https://www.kaggle.com/sudalairajkumar/covid19-in-usa?select=us_covid19_daily.csv
Our download of both datasets is October 10th, 2020.
#importing standard libraries
import pandas as pd
import numpy as np
import requests
from io import StringIO
#basic visualization package
import matplotlib.pyplot as plt
#%matplotlib inline
from datetime import datetime
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
#Interactive Visualization
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# for offline ploting
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)
# For predictive analysis
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
First we will focus on cleaning the Trips by distance dataset. While it contains multiple levels (county, state, national), we are only interested in the state level. Then we will merge it with the Daily COVID dataset and merge them together. Lastly, we will break the datasets up per state for easier manipulation and to minimize one-hot encoding later.
#import datasets
df_trips_by_distance = pd.read_csv("https://www.dropbox.com/s/970urh3ye9g728h/Trips_by_Distance.csv?dl=1")
df_COVID19_state = pd.read_csv("https://www.dropbox.com/s/6rv32rseh84k1pi/us_states_covid19_daily.csv?dl=1")
The first case of COVID-19 in USA was on 1/20/2020, but since our other dataset starts from 1/22/2020, we will use that date instead.
#new dataframe called df_trips_by_distance_state from df_trips_by_distance, only looking at state level travels
df_trips_by_distance_state = df_trips_by_distance.loc[(df_trips_by_distance['Level'] == "State") & (df_trips_by_distance['Date'] >= "2020/01/22")]
df_trips_by_distance_state = df_trips_by_distance_state.drop(["Level", "State FIPS", "County FIPS", "County Name"], 1)
#convert the date column to integer to match the COVID dataset
df_trips_by_distance_state["Date"] = pd.to_datetime(df_trips_by_distance_state["Date"]).dt.strftime("%Y%m%d").astype(int)
#reset index
df_trips_by_distance_state.reset_index(inplace = True)
#rename columns
df_trips_by_distance_state.rename(columns = {"State Postal Code": "State"}, inplace = True)
df_trips_by_distance_state.rename(columns = {"Number of Trips": "nTrips"}, inplace = True)
We decided that combining the 10 "number of trips" columns into 3 groups will be more efficient for our analysis. The first group summons the number of trips ranging from less than 1 mile to 25 miles (inclusive). The second group summons the number of trips ranging from 25 miles (exclusive) to 250 miles, and the last group from 250 miles (exclusive) to 500 miles.
#combine number of trips into groups (1-25, 25-250, 250-500)
#group1
df_trips_by_distance_state["Number_of_Trips_1_25"] = df_trips_by_distance_state["Number of Trips <1"] + df_trips_by_distance_state["Number of Trips 1-3"] + df_trips_by_distance_state["Number of Trips 3-5"] + df_trips_by_distance_state["Number of Trips 5-10"]
+ df_trips_by_distance_state["Number of Trips 10-25"]
#group2
df_trips_by_distance_state["Number_of_Trips_25_250"] = df_trips_by_distance_state["Number of Trips 50-100"] + df_trips_by_distance_state["Number of Trips 100-250"]
#group3
df_trips_by_distance_state["Number_of_Trips_250_"] = df_trips_by_distance_state["Number of Trips 250-500"] + df_trips_by_distance_state["Number of Trips >=500"]
df_trips_by_distance_state
| index | Date | State | Population Staying at Home | Population Not Staying at Home | nTrips | Number of Trips <1 | Number of Trips 1-3 | Number of Trips 3-5 | Number of Trips 5-10 | Number of Trips 10-25 | Number of Trips 25-50 | Number of Trips 50-100 | Number of Trips 100-250 | Number of Trips 250-500 | Number of Trips >=500 | Number_of_Trips_1_25 | Number_of_Trips_25_250 | Number_of_Trips_250_ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1245449 | 20200122 | AL | 807735.0 | 4080136.0 | 16164304.0 | 3480408.0 | 3874504.0 | 2156843.0 | 2708360.0 | 2596986.0 | 902492.0 | 309079.0 | 107087.0 | 23934.0 | 4611.0 | 12220115.0 | 416166.0 | 28545.0 |
| 1 | 1245450 | 20200122 | AK | 159711.0 | 577727.0 | 2949908.0 | 874686.0 | 800754.0 | 403581.0 | 411898.0 | 308836.0 | 111693.0 | 25079.0 | 8213.0 | 1378.0 | 3790.0 | 2490919.0 | 33292.0 | 5168.0 |
| 2 | 1245451 | 20200122 | AZ | 1352937.0 | 5818709.0 | 25236411.0 | 6687167.0 | 6068252.0 | 3031583.0 | 3790445.0 | 3989392.0 | 1171535.0 | 305405.0 | 147048.0 | 27142.0 | 18442.0 | 19577447.0 | 452453.0 | 45584.0 |
| 3 | 1245452 | 20200122 | AR | 544871.0 | 2468954.0 | 9272794.0 | 2130987.0 | 2415655.0 | 1184837.0 | 1429620.0 | 1343605.0 | 490685.0 | 181059.0 | 75534.0 | 17561.0 | 3251.0 | 7161099.0 | 256593.0 | 20812.0 |
| 4 | 1245453 | 20200122 | CA | 7227607.0 | 32329438.0 | 145519407.0 | 40323676.0 | 36885229.0 | 16871766.0 | 20809450.0 | 20492587.0 | 7281951.0 | 2148748.0 | 533057.0 | 114312.0 | 58631.0 | 114890121.0 | 2681805.0 | 172943.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13408 | 2072894 | 20201010 | DE | 253912.0 | 713259.0 | 2754854.0 | 591440.0 | 670914.0 | 362842.0 | 457957.0 | 410734.0 | 170809.0 | 71527.0 | 17162.0 | 1034.0 | 435.0 | 2083153.0 | 88689.0 | 1469.0 |
| 13409 | 2072895 | 20201010 | ME | 372027.0 | 966377.0 | 3517685.0 | 639075.0 | 825972.0 | 433764.0 | 568335.0 | 647161.0 | 258947.0 | 108178.0 | 31313.0 | 3439.0 | 1501.0 | 2467146.0 | 139491.0 | 4940.0 |
| 13410 | 2072896 | 20201010 | OK | 1056873.0 | 2886206.0 | 10486457.0 | 2263637.0 | 2643666.0 | 1252829.0 | 1594162.0 | 1686304.0 | 611996.0 | 304401.0 | 112079.0 | 15137.0 | 2246.0 | 7754294.0 | 416480.0 | 17383.0 |
| 13411 | 2072897 | 20201010 | WI | 1534312.0 | 4279256.0 | 16821730.0 | 3598155.0 | 4026875.0 | 2056805.0 | 2605089.0 | 2695790.0 | 1111973.0 | 508067.0 | 190840.0 | 23237.0 | 4899.0 | 12286924.0 | 698907.0 | 28136.0 |
| 13412 | 2072898 | 20201010 | WV | 501681.0 | 1304151.0 | 4555131.0 | 851640.0 | 1122425.0 | 552794.0 | 751122.0 | 729302.0 | 323742.0 | 163886.0 | 52794.0 | 6459.0 | 967.0 | 3277981.0 | 216680.0 | 7426.0 |
13413 rows × 19 columns
# Check for Null values
df_trips_by_distance_state.isnull().sum()
index 0 Date 0 State 0 Population Staying at Home 0 Population Not Staying at Home 0 nTrips 0 Number of Trips <1 0 Number of Trips 1-3 0 Number of Trips 3-5 0 Number of Trips 5-10 0 Number of Trips 10-25 0 Number of Trips 25-50 0 Number of Trips 50-100 0 Number of Trips 100-250 0 Number of Trips 250-500 0 Number of Trips >=500 0 Number_of_Trips_1_25 0 Number_of_Trips_25_250 0 Number_of_Trips_250_ 0 dtype: int64
Next, we want to prepare the COVID-19 dataset.
#The top row is the latest data, so we need to shift the index and reset it such that Jan 22, 2020 is the new first row
df_COVID19_state = df_COVID19_state.iloc[::-1]
df_COVID19_state.reset_index(inplace = True)
#Extract variables of interest
df_COVID19_state_wrangled = df_COVID19_state[["date", "state", "positive", "positiveIncrease"]]
#view df_COVID19_state_wrangled
df_COVID19_state_wrangled
| date | state | positive | positiveIncrease | |
|---|---|---|---|---|
| 0 | 20200122 | WA | 0.0 | 0 |
| 1 | 20200122 | MA | NaN | 0 |
| 2 | 20200123 | WA | 0.0 | 0 |
| 3 | 20200123 | MA | NaN | 0 |
| 4 | 20200124 | WA | 0.0 | 0 |
| ... | ... | ... | ... | ... |
| 11629 | 20200927 | AZ | 217237.0 | 411 |
| 11630 | 20200927 | AS | 0.0 | 0 |
| 11631 | 20200927 | AR | 81242.0 | 487 |
| 11632 | 20200927 | AL | 152321.0 | 730 |
| 11633 | 20200927 | AK | 8431.0 | 116 |
11634 rows × 4 columns
# Check for Null values
df_COVID19_state_wrangled.isnull().sum()
date 0 state 0 positive 99 positiveIncrease 0 dtype: int64
# After manual exploration, these null values occur when there are zero positive cases so we will replace them with zeroes
df_COVID19_state_wrangled['positive'] = df_COVID19_state_wrangled['positive'].fillna(0)
C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#merge the two datasets
df_combined = pd.merge(left = df_trips_by_distance_state, right=df_COVID19_state_wrangled, how='left', left_on=["Date","State"], right_on=["date","state"])
#Create region mapping for four regions
regions = {'AK': 'W','AL': 'S', 'AR': 'S', 'AZ': 'W', 'CA': 'W',
'CO': 'W', 'CT': 'N', 'DC': 'N', 'DE': 'N', 'FL': 'S', 'GA': 'S', 'HI': 'W', 'IA': 'M',
'ID': 'W', 'IL': 'M', 'IN': 'M', 'KS': 'M', 'KY': 'S', 'LA': 'S', 'MA': 'N', 'MD': 'N',
'ME': 'N', 'MI': 'W', 'MN': 'M', 'MO': 'M', 'MS': 'S', 'MT': 'W', 'NC': 'S', 'ND': 'M',
'NE': 'W', 'NH': 'N', 'NJ': 'N', 'NM': 'W', 'NV': 'W', 'NY': 'N', 'OH': 'M', 'OK': 'S',
'OR': 'W', 'PA': 'N', 'RI': 'N', 'SC': 'S', 'SD': 'M', 'TN': 'S', 'TX': 'S', 'UT': 'W',
'VA': 'S', 'VT': 'N', 'WA': 'W', 'WI': 'M', 'WV': 'S', 'WY': 'W' }
df_combined['region'] = df_combined['State'].map(regions)
#drop repeated columns
df_combined = df_combined.drop(["date", "state"], 1)
#Get ratio of population not staying home
df_combined["Ratio"] = df_combined["Population Not Staying at Home"]/ (df_combined["Population Staying at Home"]+df_combined["Population Not Staying at Home"])
df_combined
| index | Date | State | Population Staying at Home | Population Not Staying at Home | nTrips | Number of Trips <1 | Number of Trips 1-3 | Number of Trips 3-5 | Number of Trips 5-10 | ... | Number of Trips 100-250 | Number of Trips 250-500 | Number of Trips >=500 | Number_of_Trips_1_25 | Number_of_Trips_25_250 | Number_of_Trips_250_ | positive | positiveIncrease | region | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1245449 | 20200122 | AL | 807735.0 | 4080136.0 | 16164304.0 | 3480408.0 | 3874504.0 | 2156843.0 | 2708360.0 | ... | 107087.0 | 23934.0 | 4611.0 | 12220115.0 | 416166.0 | 28545.0 | NaN | NaN | S | 0.834747 |
| 1 | 1245450 | 20200122 | AK | 159711.0 | 577727.0 | 2949908.0 | 874686.0 | 800754.0 | 403581.0 | 411898.0 | ... | 8213.0 | 1378.0 | 3790.0 | 2490919.0 | 33292.0 | 5168.0 | NaN | NaN | W | 0.783425 |
| 2 | 1245451 | 20200122 | AZ | 1352937.0 | 5818709.0 | 25236411.0 | 6687167.0 | 6068252.0 | 3031583.0 | 3790445.0 | ... | 147048.0 | 27142.0 | 18442.0 | 19577447.0 | 452453.0 | 45584.0 | NaN | NaN | W | 0.811349 |
| 3 | 1245452 | 20200122 | AR | 544871.0 | 2468954.0 | 9272794.0 | 2130987.0 | 2415655.0 | 1184837.0 | 1429620.0 | ... | 75534.0 | 17561.0 | 3251.0 | 7161099.0 | 256593.0 | 20812.0 | NaN | NaN | S | 0.819209 |
| 4 | 1245453 | 20200122 | CA | 7227607.0 | 32329438.0 | 145519407.0 | 40323676.0 | 36885229.0 | 16871766.0 | 20809450.0 | ... | 533057.0 | 114312.0 | 58631.0 | 114890121.0 | 2681805.0 | 172943.0 | NaN | NaN | W | 0.817286 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13408 | 2072894 | 20201010 | DE | 253912.0 | 713259.0 | 2754854.0 | 591440.0 | 670914.0 | 362842.0 | 457957.0 | ... | 17162.0 | 1034.0 | 435.0 | 2083153.0 | 88689.0 | 1469.0 | NaN | NaN | N | 0.737469 |
| 13409 | 2072895 | 20201010 | ME | 372027.0 | 966377.0 | 3517685.0 | 639075.0 | 825972.0 | 433764.0 | 568335.0 | ... | 31313.0 | 3439.0 | 1501.0 | 2467146.0 | 139491.0 | 4940.0 | NaN | NaN | N | 0.722037 |
| 13410 | 2072896 | 20201010 | OK | 1056873.0 | 2886206.0 | 10486457.0 | 2263637.0 | 2643666.0 | 1252829.0 | 1594162.0 | ... | 112079.0 | 15137.0 | 2246.0 | 7754294.0 | 416480.0 | 17383.0 | NaN | NaN | S | 0.731968 |
| 13411 | 2072897 | 20201010 | WI | 1534312.0 | 4279256.0 | 16821730.0 | 3598155.0 | 4026875.0 | 2056805.0 | 2605089.0 | ... | 190840.0 | 23237.0 | 4899.0 | 12286924.0 | 698907.0 | 28136.0 | NaN | NaN | M | 0.736081 |
| 13412 | 2072898 | 20201010 | WV | 501681.0 | 1304151.0 | 4555131.0 | 851640.0 | 1122425.0 | 552794.0 | 751122.0 | ... | 52794.0 | 6459.0 | 967.0 | 3277981.0 | 216680.0 | 7426.0 | NaN | NaN | S | 0.722188 |
13413 rows × 23 columns
#Again, we have null values for when the Trip by Distance dataset has more observations than the COVID19 dataset
#Since we confirmed there were no unexpected null values before the merge, we can replace the values with 0
df_combined = df_combined.fillna(0)
#save combined dataset for future use
df_combined.to_csv('combined.csv')
#Next we will break up the dataset per state
state_list = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
for state in state_list:
#create new dataset for each state
exec(f'df_combined_{state} = df_combined.loc[(df_combined["State"] == state)]')
#Convert combined date from integer to datetime for plotting
exec(f'df_combined_{state}["Date"] = pd.to_datetime(df_combined_{state}["Date"], format="%Y%m%d")')
#Convert combined to datetime since we no longer need to worry about the merge
df_combined['Date'] = pd.to_datetime(df_combined['Date'], format='%Y%m%d')
C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Both datasets were standardized in their values and date formatting. We had an easy time merging and extracting only what we were interrested in for analysis.
We didn't have to do any pre-processing to our datasets as both csv files were collected by their respective source.
From the Trips By Distance dataset, we dropped: "Level", "State FIPS", "County FIPS", and "County Name".
From the Daily COVID 19 dataset, we dropped: 'index', 'negative', 'pending', 'totalTestResults', 'hospitalizedCurrently', 'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative', 'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered', 'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt', 'death', 'hospitalized', 'dateChecked', 'totalTestsViral', 'positiveTestsViral', 'negativeTestsViral', 'positiveCasesViral', 'deathConfirmed', 'deathProbable', 'totalTestEncountersViral', 'totalTestsPeopleViral', 'totalTestsAntibody', 'positiveTestsAntibody', 'negativeTestsAntibody', 'totalTestsPeopleAntibody', 'positiveTestsPeopleAntibody', 'negativeTestsPeopleAntibody', 'totalTestsPeopleAntigen', 'positiveTestsPeopleAntigen', 'totalTestsAntigen', 'positiveTestsAntigen', 'fips', 'negativeIncrease', 'total', 'totalTestResultsSource', 'totalTestResultsIncrease', 'posNeg', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', and 'grade'
We merged several of the Trips By Distance dataset into four groupings for analysis based on distance range.
We added state regions which will later help with plotting. We also added ratio, which calculates proportion of population which is not staying at home.
We didn't drop any rows due to null values. Instead, we converted them to zeroes as they only occured in states before any COVID19 cases were reported.
From the Trips by Distance dataset, we dropped 2059493 rows as they were either at the county or national level since we only cared about the state level and interstate travel. We note there will be some loss in granularity of our data, but our Trips by Distance dataset doesn't describe the county level so the additional data is irrelevant.
#Get total combined shape
print("Number of observations: ", df_combined.shape[0])
print("Number of features: ", df_combined.shape[1])
#Get date range that we are observing
print("Dates observed: ", min(df_combined['Date']), " to ", max(df_combined['Date']))
#Double check null observations
print("Total null observations: ", df_combined.isnull().sum().sum())
#Check which states we are analyzing
print("Number of states: ", len(df_combined['State'].unique()))
print("States we are analyzing: ", df_combined['State'].unique())
#Check number of trips by distance
print('Number of trips < 25 miles:', df_combined['Number_of_Trips_1_25'].sum())
print('Number of trips > 25 miles:', (df_combined['Number_of_Trips_25_250'] + df_combined['Number_of_Trips_250_']).sum())
Number of observations: 13413 Number of features: 23 Dates observed: 2020-01-22 00:00:00 to 2020-10-10 00:00:00 Total null observations: 0 Number of states: 51 States we are analyzing: ['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'DC' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE' 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY'] Number of trips < 25 miles: 205233058561.0 Number of trips > 25 miles: 8040487765.0
In summation, we have a large dataset analyzing trips and daily COVID cases in 50 states + Washington DC between Jan 22 and October 10, 2020.
First, we will look at a heatmap that shows correlation between the distance traveled and positive increase variables.
df_combined_heatmap = df_combined[["Population Staying at Home", "Population Not Staying at Home", "Number_of_Trips_1_25", "Number_of_Trips_25_250", "Number_of_Trips_250_", "positiveIncrease", "positive"]]
#Shows the correlation
plt.subplots(figsize=(10, 10))
plt.title("Heatmap of trips and COVID-19 new cases data")
heatmap = sns.heatmap(df_combined_heatmap.corr(), annot=True)
For each state, we will drop any row that does not have a positive increase > 0. This makes no mathematical sense as a state cannot have negative cases of positive cases of COVID-19. We believe that these were data entry errors, so we'll omit them for our analysis.
df_combined[df_combined['positiveIncrease'] < 0]
| index | Date | State | Population Staying at Home | Population Not Staying at Home | nTrips | Number of Trips <1 | Number of Trips 1-3 | Number of Trips 3-5 | Number of Trips 5-10 | ... | Number of Trips 100-250 | Number of Trips 250-500 | Number of Trips >=500 | Number_of_Trips_1_25 | Number_of_Trips_25_250 | Number_of_Trips_250_ | positive | positiveIncrease | region | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5330 | 1580794 | 2020-05-05 | MT | 299169.0 | 763136.0 | 3270901.0 | 942011.0 | 878518.0 | 375246.0 | 414189.0 | ... | 31577.0 | 4543.0 | 2956.0 | 2609964.0 | 114241.0 | 7499.0 | 456.0 | -1.0 | W | 0.718377 |
| 5825 | 1603290 | 2020-05-15 | HI | 384810.0 | 1035681.0 | 3631728.0 | 1158429.0 | 980166.0 | 406933.0 | 466745.0 | ... | 2059.0 | 770.0 | 1536.0 | 3012273.0 | 11638.0 | 2306.0 | 637.0 | -1.0 | W | 0.729101 |
| 6233 | 1625699 | 2020-05-23 | HI | 328401.0 | 1092090.0 | 3892433.0 | 1147837.0 | 1097663.0 | 458473.0 | 538047.0 | ... | 4843.0 | 2585.0 | 1856.0 | 3242020.0 | 23655.0 | 4441.0 | 642.0 | -5.0 | W | 0.768812 |
| 6432 | 1647899 | 2020-05-27 | CT | 796770.0 | 2775895.0 | 12693621.0 | 3370642.0 | 3231170.0 | 1516759.0 | 1813938.0 | ... | 44302.0 | 12586.0 | 13454.0 | 9932509.0 | 244193.0 | 26040.0 | 41288.0 | -15.0 | N | 0.776982 |
| 6965 | 1670433 | 2020-06-06 | MT | 280757.0 | 781548.0 | 2872045.0 | 768140.0 | 777777.0 | 327424.0 | 366157.0 | ... | 41590.0 | 8208.0 | 3900.0 | 2239498.0 | 138927.0 | 12108.0 | 540.0 | -1.0 | W | 0.735710 |
| 7542 | 1715012 | 2020-06-17 | VT | 134844.0 | 491455.0 | 2629558.0 | 619637.0 | 691093.0 | 337313.0 | 379879.0 | ... | 10854.0 | 1580.0 | 1078.0 | 2027922.0 | 50397.0 | 2658.0 | 1130.0 | -1.0 | N | 0.784697 |
| 7617 | 1715087 | 2020-06-19 | LA | 849318.0 | 3810660.0 | 16512982.0 | 4126671.0 | 4067245.0 | 2062082.0 | 2648260.0 | ... | 159947.0 | 27099.0 | 7958.0 | 12904258.0 | 525427.0 | 35057.0 | 48515.0 | -119.0 | S | 0.817742 |
| 9159 | 1826634 | 2020-07-19 | NJ | 2658546.0 | 6249974.0 | 24320193.0 | 7015985.0 | 6144137.0 | 2757541.0 | 3268146.0 | ... | 180580.0 | 37769.0 | 18265.0 | 19185809.0 | 748185.0 | 56034.0 | 176783.0 | -31.0 | N | 0.701573 |
| 9442 | 1826917 | 2020-07-25 | DE | 243739.0 | 723432.0 | 2768748.0 | 693641.0 | 663280.0 | 345560.0 | 440838.0 | ... | 20011.0 | 1404.0 | 1033.0 | 2143319.0 | 84577.0 | 2437.0 | 14175.0 | -27.0 | N | 0.747988 |
| 10301 | 1893779 | 2020-08-10 | WY | 134927.0 | 442810.0 | 2013099.0 | 501627.0 | 557454.0 | 254433.0 | 204098.0 | ... | 53801.0 | 5966.0 | 2012.0 | 1517612.0 | 155227.0 | 7978.0 | 3042.0 | -8.0 | W | 0.766456 |
| 10509 | 1893987 | 2020-08-15 | AR | 560038.0 | 2453787.0 | 8696740.0 | 1866058.0 | 2147444.0 | 1064206.0 | 1370135.0 | ... | 124019.0 | 27086.0 | 11234.0 | 6447843.0 | 382056.0 | 38320.0 | 51992.0 | -400.0 | S | 0.814177 |
| 10665 | 1916144 | 2020-08-18 | CT | 880430.0 | 2692235.0 | 9638996.0 | 2067296.0 | 2304484.0 | 1157634.0 | 1426094.0 | ... | 84063.0 | 25402.0 | 90550.0 | 6955508.0 | 359861.0 | 115952.0 | 51255.0 | -12.0 | N | 0.753565 |
| 11125 | 1938605 | 2020-08-27 | DE | 238663.0 | 728508.0 | 2710050.0 | 606253.0 | 649892.0 | 331172.0 | 440544.0 | ... | 23009.0 | 1958.0 | 2269.0 | 2027861.0 | 86199.0 | 4227.0 | 16976.0 | -10.0 | N | 0.753236 |
| 11278 | 1960759 | 2020-08-30 | DE | 260961.0 | 706210.0 | 2454317.0 | 570979.0 | 601312.0 | 304658.0 | 399750.0 | ... | 28053.0 | 2481.0 | 2990.0 | 1876699.0 | 92539.0 | 5471.0 | 17343.0 | -6.0 | N | 0.730181 |
| 11445 | 1960926 | 2020-09-02 | MA | 2049005.0 | 4853144.0 | 22759793.0 | 5421253.0 | 5881852.0 | 2707237.0 | 3370115.0 | ... | 80345.0 | 21336.0 | 43645.0 | 17380457.0 | 413647.0 | 64981.0 | 121131.0 | -7757.0 | N | 0.703135 |
| 12636 | 2028120 | 2020-09-25 | RI | 250814.0 | 806501.0 | 3637467.0 | 914413.0 | 952237.0 | 437962.0 | 550975.0 | ... | 10076.0 | 2443.0 | 1155.0 | 2855587.0 | 46675.0 | 3598.0 | 24181.0 | -130.0 | N | 0.762782 |
16 rows × 23 columns
#drop non positive values
df_combined = df_combined[(df_combined["positiveIncrease"] >= 0)]
df_combined[df_combined['positiveIncrease'] < 0]
| index | Date | State | Population Staying at Home | Population Not Staying at Home | nTrips | Number of Trips <1 | Number of Trips 1-3 | Number of Trips 3-5 | Number of Trips 5-10 | ... | Number of Trips 100-250 | Number of Trips 250-500 | Number of Trips >=500 | Number_of_Trips_1_25 | Number_of_Trips_25_250 | Number_of_Trips_250_ | positive | positiveIncrease | region | Ratio |
|---|
0 rows × 23 columns
Let's take a look an intial look at a national level of new COVID19 cases per day.
We will distinguish colors by regions for easy inference. Code inspired by https://stackoverflow.com/questions/26139423/plot-different-color-for-different-categorical-levels-using-matplotlib. Color inspired by https://in.pinterest.com/pin/773141461014013480/?d=t&mt=signup
colors = {'W':'#028174', 'N':'#0AB68B', 'S':'#92DE8B', 'M':'#FFE3B3'}
fig, ax = plt.subplots(figsize = (12, 8))
grouped = df_combined.groupby('region')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='Date', y='positiveIncrease', label=key, color=colors[key], marker='o')
plt.xlabel('Date')
plt.ylabel('Positive Increase per Day')
ax.set_title('National Level of new COVID19 Cases per day')
plt.show()
In previous steps, we classified states into 4 groups: M, N, S, W--which stand for the Midwestern, Northern, Southern, and Western states of USA.
Immediately we can notice there are (unfortunately) three peaks, one early in the North region and then later in the South and West regions, and lastly what looks like all four regions. The North region also has one extremely negative outlier while South region has several positive outliers.
Below is a plotly interactive visualization of the daily COVID-19 cases in each state. Please don't forget to download and run this notebook to view it properly.
### colorscale
scl = [
[0, 'rgb(242,240,247)'],
[0.2, 'rgb(218,218,235)'],
[0.4, 'rgb(188,189,220)'],
[0.6, 'rgb(158,154,200)'],
[0.8, 'rgb(117,107,177)'],
[1.0, 'rgb(84,39,143)']
]
### create empty list for data object
data_slider = []
#populate the data object
for date in df_COVID19_state_wrangled.date.unique():
#select the data
df_positive_case = df_COVID19_state_wrangled[(df_COVID19_state_wrangled['date'] == date)]
for col in df_positive_case.columns: # I transform the columns into string type so I can:
df_positive_case[col] = df_positive_case[col].astype(str)
###create the text for mouse-hover for each state, for the current day
df_positive_case['text'] = df_positive_case['state'] + 'Positive Increase: '+ df_positive_case['positiveIncrease']
### create the dictionary with the data for the current date
data_date = dict(
type = 'choropleth',
locations = df_positive_case['state'],
z = df_positive_case['positiveIncrease'].astype(float),
locationmode = 'USA-states',
colorscale = scl,
text = df_positive_case['text'],
colorbar = {"title": "# of Cases"}
)
data_slider.append(data_date) # I add the dictionary to the list of dictionaries for the slider
#create the steps for the slider
steps = []
for i in range(len(data_slider)):
step = dict(method = 'restyle', args=['visible', [False] * len(data_slider)], label='Day {}'.format(i + 1)) # label to be displayed for each step (date)
step['args'][1][i] = True
steps.append(step)
#create the 'sliders' object from the 'steps'
sliders = [dict(active=0, pad={"t": 1}, steps = steps)]
#set up the layout (including slider option)
layout = go.Layout(
autosize = False, title = 'Number of Daily Positive COVID-19 Cases', width = 1000, height = 600,
geo = go.layout.Geo(
scope = 'usa',
projection = go.layout.geo.Projection(type = 'albers usa'),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
sliders=sliders,
)
#create the figure object
fig = dict(data = data_slider, layout = layout)
figure = go.FigureWidget(fig)
iplot(fig)
C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:26: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Let's next look at how the total number of trips have changed over the same time range.
fig, ax = plt.subplots(figsize = (12, 8))
grouped = df_combined.groupby('State')
for key, group in grouped:
group.plot(ax=ax, kind='line', x='Date', y='nTrips', label=key, color=colors[regions[key]], marker='o')
plt.xlabel('Date')
plt.ylabel('Number of Trips')
ax.set_title('National Level of Trips per day')
#Removing legend as this was sorted by all 50 states
#Color scheme stays consistent as above however
plt.legend(loc="best", bbox_to_anchor=(1.05, 1))
plt.show()
The states in the same regional category (M, N, S, W) are sharing the same color. This is a little concerning. Aside from an initial decrease between March and April in total trips for certain states, there seems to be little to no change in total trips for the subsequent months while there were two peaks in COVID increases (ignoring the lone spike).
Again, another plotly interactive visualization of the number of daily travels in each state.
data_slider2 = []
#populate the data object
for travel_date in df_trips_by_distance_state.Date.unique():
#select the data
df_travel = df_trips_by_distance_state[(df_trips_by_distance_state['Date'] == travel_date)]
for col in df_travel.columns:
df_travel[col] = df_travel[col].astype(str)
#create the text for mouse-hover for each state, for the current year
df_travel['text'] = df_travel['State'] + 'Number of Travels: '+ df_travel['nTrips']
### create the dictionary with the data for the current date
data = dict(
type = 'choropleth',
locations = df_travel['State'],
z = df_travel['nTrips'].astype(float),
locationmode = 'USA-states',
colorscale = scl,
text = df_travel['text'],
colorbar = {"title": "# of Travels"}
)
data_slider2.append(data) # I add the dictionary to the list of dictionaries for the slider
#create the steps for the slider
steps2 = []
for i in range(len(data_slider2)):
step = dict(method = 'restyle', args=['visible', [False] * len(data_slider2)], label='Day {}'.format(i + 1)) # label to be displayed for each step (date)
step['args'][1][i] = True
steps2.append(step)
#create the 'sliders' object from the 'steps'
sliders = [dict(active=0, pad={"t": 1}, steps = steps2)]
#set up the layout (including slider option)
layout = go.Layout(
autosize = False, title = 'Number of Travels', width = 1000, height = 600,
geo = go.layout.Geo(
scope = 'usa',
projection = go.layout.geo.Projection(type = 'albers usa'),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
sliders=sliders,
)
#create the figure object
fig2 = dict(data = data_slider2, layout = layout)
#plot in the notebook
iplot(fig2)
C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Because US states like California and Texas are relatively highly populous and hence have higher number of travels, this visualization doesn't depict the travels of other less populous states as intended. Hover over each state to see the accurate numbers increase or decrease.
Let's plot the relationship between Number of Trips and Positive Increase. If the Number of Trips doesn't change very much, but new positive cases does, then it will be hard to establish a relationship between the two.
fig, ax = plt.subplots(figsize = (12, 8))
grouped = df_combined.groupby('region')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x="nTrips", y='positiveIncrease', label=key, color=colors[key], marker='o')
plt.xlabel('Number of Trips')
plt.ylabel('Positive Increase per Day')
ax.set_title('National: Positive Increase X Number of Trips')
plt.show()
There seems to be some positive correlation, however there are clear differences depending even within regions. There is also the issue that at the far right of the graph where there are a lot of trips with zero new positive COVID cases. This is explained by how we see a large number of trips in the West early in the year, while the COVID cases only increased in the later half of the year. Another explanation for all the variability is due to the fact that not all states have an equal distribution in population and average trips, with more populous states (i.e. California) skewing the data. To avoid regional and national bias, we will now analyze everything at a state level.
As seen in the graph, there is extremely varying data regarding the relationship each state has with new COVID cases. Some of this is due to how most states only started to get COVID increases several months after we started tracking.
#Get state level correlation between positiveIncrease and Number of Trips, including distance
for state in state_list:
exec(f'df_combined_corr_ntrip_{state} = df_combined_{state}["nTrips"].corr(df_combined_{state}["positiveIncrease"])')
exec(f'df_combined_corr_1_25_{state} = df_combined_{state}["Number_of_Trips_1_25"].corr(df_combined_{state}["positiveIncrease"])')
exec(f'df_combined_corr_25_250_{state} = df_combined_{state}["Number_of_Trips_25_250"].corr(df_combined_{state}["positiveIncrease"])')
exec(f'df_combined_corr_250_{state} = df_combined_{state}["Number_of_Trips_250_"].corr(df_combined_{state}["positiveIncrease"])')
#create correlation lists
state_corr_ntrip = []
state_corr_1_25 = []
state_corr_25_250 = []
state_corr_250 = []
#collect state correlations into lists
state_corr_ntrip.extend(value for name, value in locals().items() if name.startswith("df_combined_corr_ntrip_"))
state_corr_1_25.extend(value for name, value in locals().items() if name.startswith("df_combined_corr_1_25_"))
state_corr_25_250.extend(value for name, value in locals().items() if name.startswith("df_combined_corr_25_250_"))
state_corr_250.extend(value for name, value in locals().items() if name.startswith("df_combined_corr_250_"))
#create correlation dataframes
df_corr_ntrip = pd.DataFrame(state_corr_ntrip, index = state_list, columns = ["Number of Trips"])
df_corr_1_25 = pd.DataFrame(state_corr_1_25, index = state_list, columns = ["Number of Trips (1-25)"])
df_corr_25_250 = pd.DataFrame(state_corr_25_250, index = state_list, columns = ["Number of Trips (25-250)"])
df_corr_250 = pd.DataFrame(state_corr_250, index = state_list, columns = ["Number of Trips (250+)"])
#merge dataframes
df_corr = pd.concat([df_corr_ntrip, df_corr_1_25], axis = 1)
df_corr = pd.concat([df_corr, df_corr_25_250], axis = 1)
df_corr = pd.concat([df_corr, df_corr_250], axis = 1)
df_corr
| Number of Trips | Number of Trips (1-25) | Number of Trips (25-250) | Number of Trips (250+) | |
|---|---|---|---|---|
| AL | -0.360377 | -0.426324 | 0.155837 | 0.275510 |
| AK | -0.455003 | -0.490661 | 0.366043 | 0.328184 |
| AZ | -0.384762 | -0.407092 | 0.091454 | 0.145421 |
| AR | -0.189026 | -0.257154 | 0.347377 | 0.256339 |
| CA | -0.522484 | -0.555735 | 0.133078 | 0.225204 |
| CO | -0.518621 | -0.517225 | -0.065125 | -0.020546 |
| CT | -0.218259 | -0.135210 | -0.605032 | -0.169249 |
| DC | -0.305811 | -0.275223 | -0.397344 | -0.176218 |
| DE | -0.173245 | -0.109212 | -0.308956 | -0.212701 |
| FL | -0.331047 | -0.380926 | 0.262362 | 0.240644 |
| GA | -0.308583 | -0.372355 | 0.351646 | 0.389667 |
| HI | -0.248937 | -0.280970 | -0.149829 | 0.269342 |
| ID | -0.394782 | -0.457760 | 0.439194 | 0.134214 |
| IL | -0.361639 | -0.370029 | -0.049315 | -0.013006 |
| IN | -0.303459 | -0.348967 | 0.093960 | 0.195888 |
| IA | -0.163708 | -0.210464 | 0.119247 | 0.059884 |
| KS | -0.139290 | -0.178622 | 0.079360 | 0.190708 |
| KY | -0.318558 | -0.397294 | 0.303427 | 0.370280 |
| LA | -0.392339 | -0.409881 | -0.106493 | 0.035080 |
| ME | 0.024389 | 0.079218 | -0.275599 | -0.236367 |
| MD | -0.330867 | -0.271953 | -0.110385 | -0.024578 |
| MA | -0.189742 | -0.108511 | -0.525567 | -0.175706 |
| MI | -0.509457 | -0.498383 | -0.202296 | 0.009166 |
| MN | -0.239824 | -0.299732 | 0.353533 | -0.022114 |
| MS | -0.281412 | -0.346470 | 0.100001 | 0.248008 |
| MO | -0.308496 | -0.395671 | 0.351028 | 0.344129 |
| MT | -0.381625 | -0.448547 | 0.480538 | 0.306800 |
| NE | -0.190124 | -0.223116 | 0.063817 | -0.088733 |
| NV | -0.507014 | -0.530355 | 0.199543 | 0.189703 |
| NH | -0.023073 | 0.088808 | -0.442573 | -0.274552 |
| NJ | -0.425592 | -0.324126 | -0.721688 | -0.218126 |
| NM | -0.426873 | -0.433760 | -0.016350 | 0.061673 |
| NY | -0.464693 | -0.391495 | -0.691374 | -0.199842 |
| NC | -0.369725 | -0.435051 | 0.367159 | 0.322382 |
| ND | -0.119612 | -0.144742 | 0.237676 | 0.140553 |
| OH | -0.269059 | -0.306532 | 0.136561 | 0.176971 |
| OK | -0.293833 | -0.368822 | 0.271480 | 0.351480 |
| OR | -0.484896 | -0.529442 | 0.374886 | 0.211426 |
| PA | -0.479848 | -0.435603 | -0.298230 | -0.151518 |
| RI | -0.238700 | -0.174357 | -0.543837 | -0.151374 |
| SC | -0.324562 | -0.372707 | 0.349843 | 0.328757 |
| SD | -0.137258 | -0.203171 | 0.315484 | 0.039377 |
| TN | -0.312880 | -0.386415 | 0.250456 | 0.383117 |
| TX | -0.358528 | -0.412714 | 0.039570 | 0.194234 |
| UT | -0.353090 | -0.384759 | 0.340288 | -0.030395 |
| VT | -0.481065 | -0.422474 | -0.426358 | 0.000490 |
| VA | -0.379207 | -0.393607 | 0.237814 | 0.152083 |
| WA | -0.393398 | -0.415347 | 0.032611 | 0.003309 |
| WV | -0.311729 | -0.374834 | 0.333096 | 0.310769 |
| WI | -0.240586 | -0.305525 | 0.285871 | 0.019948 |
| WY | -0.198778 | -0.263337 | 0.339287 | 0.178414 |
print("Ratio of Pos / Neg correlation in total number of trips: ", (df_corr['Number of Trips'] > 0).sum(), " / ", (df_corr['Number of Trips'] < 0).sum())
print("Ratio of Pos / Neg correlation in trips between 1-25 miles: ", (df_corr['Number of Trips (1-25)'] > 0).sum(), " / ", (df_corr['Number of Trips (1-25)'] < 0).sum())
print("Ratio of Pos / Neg correlation in trips between 25-250 miles: ", (df_corr['Number of Trips (25-250)'] > 0).sum(), " / ", (df_corr['Number of Trips (25-250)'] < 0).sum())
print("Ratio of Pos / Neg correlation in trips above 250 miles: ", (df_corr['Number of Trips (250+)'] > 0).sum(), " / ", (df_corr['Number of Trips (250+)'] < 0).sum())
Ratio of Pos / Neg correlation in total number of trips: 1 / 50 Ratio of Pos / Neg correlation in trips between 1-25 miles: 2 / 49 Ratio of Pos / Neg correlation in trips between 25-250 miles: 33 / 18 Ratio of Pos / Neg correlation in trips above 250 miles: 35 / 16
Interestingly, many states have an overall negative correlation with total trips, but at a distance of 25-250 miles the majority of states have a positive correlation for trips, with some of the positive correlation lingering in longer trips.
One of the challenges for our hypothesis is that each state exhibits different seasonality in COVID-19 cases, which is expected considering the North East was an early hotspot for the virus. This means that we will have to analyze everything at a state level and compare accuracy scores at an aggregate level.
Total trips cannot be included as a predictor since it has perfect collinearity with the other categorical variables
#Create arrays for storing results
params = []
pvals = []
rsquared = []
#Repeat for every state
for state in state_list:
#Create OLS model and fit
exec(f'outcome1_{state}, predictors1_{state} = patsy.dmatrices("positiveIncrease ~ Number_of_Trips_1_25 + Number_of_Trips_25_250 + Number_of_Trips_250_", df_combined_{state})')
exec(f'mod1_{state} = sm.OLS(outcome1_{state}, predictors1_{state})')
exec(f'res1_{state} = mod1_{state}.fit()')
#Store params
exec(f'params.append(res1_{state}.params)')
exec(f'pvals.append(res1_{state}.pvalues)')
exec(f'rsquared.append(res1_{state}.rsquared)')
#Create results dataframes
res1_params = pd.DataFrame(params, index = state_list, columns = ['Intercept','1-25','25-250','250+'])
res1_pvals = pd.DataFrame(pvals, index = state_list, columns = ['pIntercept','p(1-25)','p(25-250)','p(250+)'])
res1_rsquareds = pd.DataFrame(rsquared, index = state_list, columns = ['rSquared'])
#Merge results
df_res1 = pd.concat([res1_params, res1_pvals, res1_rsquareds],axis=1)
df_res1
| Intercept | 1-25 | 25-250 | 250+ | pIntercept | p(1-25) | p(25-250) | p(250+) | rSquared | |
|---|---|---|---|---|---|---|---|---|---|
| AL | 2215.363143 | -0.000263 | 0.002501 | 0.002672 | 7.027147e-12 | 1.914603e-16 | 1.780201e-04 | 4.743904e-01 | 0.289774 |
| AK | 86.753468 | -0.000067 | 0.000902 | 0.002262 | 3.450568e-07 | 5.292162e-18 | 1.170982e-04 | 1.319544e-05 | 0.380804 |
| AZ | 2408.175112 | -0.000199 | 0.001904 | 0.005397 | 1.858463e-05 | 1.598685e-13 | 8.409203e-02 | 9.611261e-02 | 0.207119 |
| AR | 459.316353 | -0.000192 | 0.003855 | -0.000864 | 1.548334e-02 | 5.059639e-12 | 1.060214e-10 | 7.932670e-01 | 0.270753 |
| CA | 4673.264120 | -0.000118 | 0.002107 | 0.003688 | 7.244794e-06 | 5.993251e-34 | 1.010156e-06 | 1.443534e-02 | 0.463574 |
| CO | 1000.852759 | -0.000052 | 0.000241 | -0.000750 | 1.132133e-22 | 1.438388e-19 | 1.250501e-01 | 1.365346e-01 | 0.275219 |
| CT | 1238.645344 | -0.000016 | -0.003751 | 0.002228 | 3.017014e-15 | 3.146171e-01 | 2.104693e-25 | 3.776884e-02 | 0.384280 |
| DC | 247.521363 | -0.000039 | -0.005323 | -0.000641 | 1.558162e-23 | 7.959329e-07 | 1.857887e-08 | 2.810246e-01 | 0.233749 |
| DE | 260.416440 | -0.000034 | -0.001120 | -0.004843 | 1.600608e-08 | 5.789047e-02 | 1.190632e-03 | 8.421670e-02 | 0.112054 |
| FL | 8377.094575 | -0.000324 | 0.008887 | -0.006146 | 2.909320e-07 | 1.581982e-19 | 6.870853e-12 | 6.036705e-02 | 0.327880 |
| GA | 2216.512257 | -0.000188 | 0.003328 | 0.001812 | 1.116471e-03 | 1.030215e-12 | 6.107958e-08 | 2.667115e-01 | 0.321910 |
| HI | 160.919228 | -0.000036 | -0.004226 | 0.013201 | 1.559823e-11 | 8.141918e-04 | 9.613466e-04 | 6.286413e-14 | 0.262693 |
| ID | 197.207352 | -0.000152 | 0.003342 | -0.001882 | 3.759665e-02 | 8.220313e-17 | 8.406170e-15 | 1.553220e-01 | 0.387392 |
| IL | 3434.323649 | -0.000093 | 0.000814 | -0.000272 | 5.550026e-12 | 1.419419e-10 | 7.986723e-02 | 8.087281e-01 | 0.149449 |
| IN | 1386.553332 | -0.000074 | 0.000388 | 0.001451 | 2.643383e-10 | 3.842576e-09 | 1.499177e-01 | 1.915191e-01 | 0.160966 |
| IA | 647.066325 | -0.000099 | 0.001482 | -0.003902 | 1.172823e-04 | 4.384446e-06 | 5.112102e-04 | 7.381565e-02 | 0.092731 |
| KS | 536.636170 | -0.000086 | 0.000240 | 0.005930 | 8.950598e-03 | 3.639090e-03 | 7.410205e-01 | 4.124431e-02 | 0.072655 |
| KY | 749.887970 | -0.000094 | 0.001267 | 0.000257 | 2.455197e-06 | 4.747034e-12 | 9.534758e-06 | 8.373507e-01 | 0.287769 |
| LA | 2850.771774 | -0.000202 | 0.000326 | 0.002285 | 4.193635e-14 | 7.865940e-11 | 6.428590e-01 | 5.483468e-01 | 0.172606 |
| ME | 29.860057 | 0.000005 | -0.000182 | -0.000257 | 5.970849e-04 | 5.814029e-02 | 7.579851e-04 | 4.118951e-01 | 0.097823 |
| MD | 1449.657422 | -0.000053 | -0.000431 | -0.000534 | 1.088891e-11 | 7.303035e-06 | 2.247599e-01 | 5.610082e-01 | 0.088147 |
| MA | 2683.265370 | -0.000009 | -0.004814 | 0.001646 | 2.891024e-12 | 5.788853e-01 | 2.359691e-17 | 2.288603e-01 | 0.282801 |
| MI | 2481.710667 | -0.000070 | -0.000432 | 0.000992 | 3.389147e-26 | 2.121122e-15 | 5.566847e-03 | 1.402715e-01 | 0.270789 |
| MN | 435.094008 | -0.000069 | 0.001598 | -0.003245 | 3.420531e-03 | 1.406896e-10 | 6.992489e-17 | 6.915005e-05 | 0.305717 |
| MS | 1322.273487 | -0.000238 | 0.001559 | 0.003566 | 1.121106e-08 | 1.446348e-09 | 1.770298e-02 | 3.184552e-01 | 0.191779 |
| MO | 1005.229530 | -0.000171 | 0.002804 | 0.003313 | 1.356926e-04 | 3.692986e-21 | 1.850605e-10 | 2.739507e-02 | 0.398899 |
| MT | 60.036360 | -0.000086 | 0.001249 | 0.000952 | 5.494174e-02 | 9.459384e-17 | 1.932077e-13 | 2.406373e-01 | 0.415472 |
| NE | 330.046686 | -0.000077 | 0.001484 | -0.006436 | 2.940990e-05 | 1.089502e-06 | 1.042613e-05 | 8.890913e-05 | 0.122799 |
| NV | 658.619512 | -0.000170 | 0.003253 | 0.004965 | 1.355040e-05 | 2.001372e-27 | 1.223418e-05 | 6.978765e-04 | 0.400039 |
| NH | 60.542223 | 0.000013 | -0.000569 | -0.000131 | 1.083572e-03 | 1.156232e-02 | 2.953768e-11 | 8.340419e-01 | 0.220843 |
| NJ | 6325.082202 | -0.000067 | -0.006345 | 0.004052 | 6.362835e-44 | 5.397833e-06 | 1.628437e-40 | 1.874828e-03 | 0.593551 |
| NM | 325.836599 | -0.000053 | 0.000324 | -0.000189 | 4.222686e-11 | 5.595548e-14 | 6.597717e-02 | 8.124053e-01 | 0.203068 |
| NY | 15150.875110 | -0.000082 | -0.007348 | 0.003051 | 1.810326e-47 | 4.494628e-09 | 9.966607e-36 | 5.635590e-03 | 0.574436 |
| NC | 2065.032828 | -0.000122 | 0.001903 | -0.000743 | 1.201322e-06 | 2.103220e-15 | 2.569237e-08 | 5.763342e-01 | 0.330341 |
| ND | 37.380995 | -0.000157 | 0.002724 | -0.000776 | 4.745146e-01 | 3.156811e-07 | 5.099999e-08 | 7.074142e-01 | 0.148319 |
| OH | 1598.654136 | -0.000058 | 0.000742 | -0.000247 | 6.795081e-09 | 8.462234e-09 | 2.743751e-03 | 7.783041e-01 | 0.148482 |
| OK | 796.723969 | -0.000199 | 0.003214 | 0.003991 | 5.350044e-07 | 1.819962e-22 | 1.088892e-10 | 1.255483e-01 | 0.395913 |
| OR | 243.470620 | -0.000040 | 0.000843 | -0.000352 | 1.366190e-06 | 4.233999e-24 | 4.903996e-10 | 5.398154e-01 | 0.423144 |
| PA | 2754.838068 | -0.000050 | -0.000480 | -0.000948 | 5.223382e-27 | 2.022439e-13 | 1.697791e-03 | 2.193607e-01 | 0.261123 |
| RI | 400.826173 | -0.000037 | -0.004326 | 0.004072 | 5.157472e-16 | 1.906604e-02 | 3.319648e-21 | 8.004567e-03 | 0.339120 |
| SC | 1351.374152 | -0.000187 | 0.003618 | -0.001879 | 2.245018e-04 | 1.455096e-11 | 9.389454e-08 | 4.718567e-01 | 0.278381 |
| SD | 96.566055 | -0.000103 | 0.001972 | -0.004953 | 6.081339e-02 | 1.547719e-06 | 1.231497e-11 | 2.743222e-03 | 0.199953 |
| TN | 2289.707170 | -0.000201 | 0.002502 | 0.002612 | 1.608616e-07 | 3.794027e-12 | 2.271371e-04 | 2.248781e-01 | 0.292052 |
| TX | 8739.039870 | -0.000231 | 0.003195 | 0.003171 | 1.668517e-08 | 2.329796e-17 | 2.705374e-04 | 1.613477e-01 | 0.279953 |
| UT | 308.578585 | -0.000080 | 0.002295 | -0.002924 | 1.603983e-02 | 2.064540e-12 | 1.015922e-12 | 3.338193e-04 | 0.300609 |
| VT | 46.912976 | -0.000013 | -0.000465 | 0.000863 | 3.031286e-27 | 6.145226e-09 | 1.760817e-14 | 1.193071e-03 | 0.353308 |
| VA | 1369.220940 | -0.000065 | 0.000837 | -0.001210 | 6.143550e-08 | 3.806249e-11 | 2.412758e-04 | 1.989990e-01 | 0.203329 |
| WA | 938.609637 | -0.000044 | 0.000366 | -0.000564 | 6.456963e-11 | 1.100979e-12 | 1.490299e-01 | 4.294538e-01 | 0.179241 |
| WV | 137.605324 | -0.000055 | 0.000830 | -0.000041 | 2.162398e-04 | 9.721337e-12 | 3.349569e-07 | 9.718694e-01 | 0.272203 |
| WI | 875.758601 | -0.000125 | 0.002558 | -0.006602 | 3.641149e-04 | 1.459624e-12 | 3.765109e-14 | 2.080760e-04 | 0.278237 |
| WY | 27.128636 | -0.000045 | 0.000565 | -0.000555 | 4.053672e-02 | 2.523946e-09 | 1.140678e-10 | 4.408612e-01 | 0.229155 |
print('Average Accuracy: ', df_res1['rSquared'].mean())
print('Best Accuracy: ', df_res1['rSquared'].max())
print('Worst Accuracy: ', df_res1['rSquared'].min())
Average Accuracy: 0.27318447469125234 Best Accuracy: 0.5935510167458291 Worst Accuracy: 0.07265548637166253
As seen above, the majority of the models fail to capture any meaningful relationship, and the ones with moderate accuracy (i.e. WV) have extremely low coefficients on the predictors.
Since trips threshold doesn't make a difference, we will consolidate them into one variable. We will also introduce the ratio variable which calulates the proportion of a state's population that are not staying home.
#Reset arrays for storing results
params = []
pvals = []
rsquared = []
#Repeat for every state
for state in state_list:
#Create OLS model and fit
exec(f'outcome2_{state}, predictors2_{state} = patsy.dmatrices("positiveIncrease ~ nTrips + Ratio", df_combined_{state})')
exec(f'mod2_{state} = sm.OLS(outcome2_{state}, predictors2_{state})')
exec(f'res2_{state} = mod2_{state}.fit()')
#Store params
exec(f'params.append(res2_{state}.params)')
exec(f'pvals.append(res2_{state}.pvalues)')
exec(f'rsquared.append(res2_{state}.rsquared)')
#Create results dataframes
res2_params = pd.DataFrame(params, index = state_list, columns = ['Intercept','nTrips','Ratio'])
res2_pvals = pd.DataFrame(pvals, index = state_list, columns = ['pIntercept','pnTrips','pRatio'])
res2_rsquareds = pd.DataFrame(rsquared, index = state_list, columns = ['rSquared'])
#Merge results
df_res2 = pd.concat([res2_params, res2_pvals, res2_rsquareds],axis=1)
df_res2
| Intercept | nTrips | Ratio | pIntercept | pnTrips | pRatio | rSquared | |
|---|---|---|---|---|---|---|---|
| AL | -6206.067396 | -3.158443e-04 | 14061.186036 | 5.829372e-05 | 6.360590e-16 | 1.091200e-08 | 0.232804 |
| AK | -262.430155 | -1.225040e-04 | 767.911594 | 5.380591e-03 | 1.201151e-13 | 8.867670e-06 | 0.265114 |
| AZ | -8261.986293 | -3.772425e-04 | 22091.509918 | 3.121746e-04 | 3.246408e-13 | 2.604935e-07 | 0.230814 |
| AR | -4021.044941 | -2.201022e-04 | 7821.260117 | 6.887668e-04 | 2.018852e-07 | 2.484586e-05 | 0.099573 |
| CA | -23184.640618 | -1.756746e-04 | 59814.593279 | 4.017422e-05 | 4.314918e-19 | 3.555869e-09 | 0.364358 |
| CO | 899.932812 | -4.322389e-05 | 300.304298 | 8.226369e-03 | 3.808993e-06 | 6.524281e-01 | 0.269539 |
| CT | 4329.584447 | 8.111408e-05 | -6655.105286 | 2.367820e-10 | 3.955322e-03 | 8.080790e-08 | 0.147633 |
| DC | 536.771378 | 2.156722e-05 | -821.025552 | 1.029628e-16 | 6.526454e-02 | 2.819016e-10 | 0.222397 |
| DE | 891.706468 | 4.096621e-05 | -1230.332990 | 2.766965e-08 | 5.724698e-02 | 5.493257e-06 | 0.104234 |
| FL | -17560.751303 | -3.651776e-04 | 55524.973369 | 1.338690e-02 | 5.030601e-10 | 2.410725e-05 | 0.168728 |
| GA | 2281.922219 | -1.336035e-04 | 4211.226504 | 3.223414e-01 | 2.822052e-05 | 2.696327e-01 | 0.099462 |
| HI | 801.877256 | 2.748539e-05 | -1158.651628 | 1.865190e-05 | 1.088466e-01 | 4.178782e-04 | 0.105912 |
| ID | -1199.040407 | -2.493702e-04 | 3401.773199 | 2.133514e-03 | 1.253265e-14 | 6.064157e-07 | 0.233062 |
| IL | 2842.428550 | -7.198131e-05 | 1811.255249 | 1.533640e-01 | 3.288755e-03 | 6.395288e-01 | 0.131517 |
| IN | 769.634128 | -6.122432e-05 | 1355.089992 | 2.868264e-01 | 5.419476e-04 | 3.097450e-01 | 0.095690 |
| IA | 693.022078 | -4.399513e-05 | 102.069872 | 3.800095e-01 | 1.837775e-01 | 9.415695e-01 | 0.026821 |
| KS | -353.966965 | -9.188106e-05 | 1883.231050 | 7.066821e-01 | 4.674582e-02 | 2.743257e-01 | 0.023907 |
| KY | -552.517906 | -9.790070e-05 | 2997.014878 | 3.570673e-01 | 2.173442e-07 | 5.090930e-03 | 0.128242 |
| LA | -3593.937290 | -2.730431e-04 | 10738.736497 | 8.040082e-02 | 2.584448e-09 | 1.394563e-03 | 0.186581 |
| ME | 237.379911 | 1.986295e-05 | -395.613734 | 2.088192e-11 | 1.532761e-08 | 1.373947e-10 | 0.147310 |
| MD | 4371.286341 | 1.152444e-05 | -5424.659038 | 2.325667e-09 | 4.810656e-01 | 2.294081e-05 | 0.168919 |
| MA | 9683.650929 | 1.012560e-04 | -16127.524986 | 3.204031e-10 | 7.779962e-04 | 6.025209e-08 | 0.139108 |
| MI | 6395.750866 | 1.667706e-05 | -8447.027799 | 1.598490e-18 | 2.022929e-01 | 1.436351e-09 | 0.356994 |
| MN | -1357.888808 | -9.431055e-05 | 4380.262896 | 7.778195e-02 | 1.131165e-05 | 2.244034e-03 | 0.090828 |
| MS | -5375.664345 | -3.071812e-04 | 10304.429902 | 2.427589e-05 | 2.216909e-12 | 8.656893e-08 | 0.175452 |
| MO | -2553.990739 | -1.791790e-04 | 8365.478860 | 4.766080e-02 | 6.655048e-08 | 4.608076e-04 | 0.136950 |
| MT | -285.712873 | -1.301032e-04 | 987.315066 | 1.750849e-02 | 1.052142e-13 | 4.964041e-06 | 0.211600 |
| NE | 361.368951 | -4.105850e-05 | 72.070976 | 3.504717e-01 | 1.034656e-01 | 9.138391e-01 | 0.036190 |
| NV | -2584.018160 | -2.948437e-04 | 7341.759675 | 3.664175e-05 | 2.447723e-20 | 5.154413e-10 | 0.359789 |
| NH | 450.049531 | 3.127571e-05 | -729.360874 | 1.274448e-09 | 4.139255e-06 | 8.563638e-09 | 0.120364 |
| NJ | 14784.448487 | 6.487027e-05 | -21730.748641 | 9.099827e-21 | 1.956751e-02 | 5.807679e-13 | 0.329589 |
| NM | -499.000605 | -8.319583e-05 | 1602.148185 | 2.280441e-02 | 4.824652e-12 | 3.604907e-05 | 0.234279 |
| NY | 32033.571735 | 5.243076e-05 | -49064.658117 | 2.786639e-18 | 7.820905e-02 | 1.854662e-10 | 0.329528 |
| NC | -2609.625113 | -1.427149e-04 | 10343.525918 | 8.177223e-02 | 1.000349e-12 | 3.287359e-05 | 0.192199 |
| ND | 754.923534 | 4.285965e-05 | -1005.504952 | 4.560419e-03 | 3.516491e-01 | 2.722763e-02 | 0.032656 |
| OH | 20.956437 | -5.525199e-05 | 3434.221330 | 9.836232e-01 | 5.178109e-04 | 8.536490e-02 | 0.082912 |
| OK | -1717.859898 | -1.657821e-04 | 5175.239983 | 7.958487e-02 | 1.852072e-06 | 2.743235e-03 | 0.117383 |
| OR | -531.594966 | -6.270491e-05 | 1970.185636 | 1.755088e-02 | 1.736242e-15 | 2.889022e-06 | 0.296995 |
| PA | 6411.498171 | -3.126745e-07 | -7745.194566 | 2.983251e-12 | 9.771143e-01 | 4.327737e-06 | 0.290403 |
| RI | 1583.574412 | 1.060544e-04 | -2450.434150 | 1.322192e-12 | 4.069070e-04 | 1.782346e-09 | 0.179756 |
| SC | 461.983464 | -1.577032e-04 | 3346.386932 | 7.229485e-01 | 1.197912e-06 | 1.037757e-01 | 0.114418 |
| SD | 469.587178 | -1.426240e-05 | -443.793798 | 5.631126e-02 | 6.462266e-01 | 2.649028e-01 | 0.023528 |
| TN | -6806.414333 | -2.780056e-04 | 17196.995005 | 4.219989e-04 | 1.665292e-12 | 2.379829e-07 | 0.186086 |
| TX | -41512.443260 | -3.124276e-04 | 88584.220818 | 6.954736e-06 | 1.685499e-14 | 9.772323e-09 | 0.232267 |
| UT | -751.747561 | -1.198485e-04 | 3018.261645 | 1.756059e-01 | 8.235817e-08 | 2.568522e-03 | 0.154807 |
| VT | 92.033356 | -8.416029e-06 | -88.528343 | 7.985314e-08 | 2.341486e-03 | 1.426904e-03 | 0.260964 |
| VA | 1634.817950 | -5.544185e-05 | 582.327211 | 6.826305e-02 | 1.174114e-04 | 7.085285e-01 | 0.144259 |
| WA | -195.698032 | -5.590747e-05 | 2343.130160 | 7.457906e-01 | 1.768070e-06 | 3.430032e-02 | 0.169228 |
| WV | -36.970084 | -5.252984e-05 | 507.072063 | 7.911841e-01 | 1.383909e-06 | 3.474690e-02 | 0.112551 |
| WI | 470.084886 | -7.951583e-05 | 1846.577828 | 7.296949e-01 | 1.973659e-02 | 4.520105e-01 | 0.059933 |
| WY | -90.438512 | -4.694180e-05 | 272.291839 | 1.165237e-01 | 3.143686e-05 | 5.518596e-03 | 0.067598 |
print('Average Accuracy: ', df_res2['rSquared'].mean())
print('Best Accuracy: ', df_res2['rSquared'].max())
print('Worst Accuracy: ', df_res2['rSquared'].min())
Average Accuracy: 0.17041627095376163 Best Accuracy: 0.36435789488902015 Worst Accuracy: 0.02352801072313948
This model performed only marginally better than the previous one. The coefficients on number of trips are still close to zero. What is surprising however is that ratio of people not staying home has a negative effect on positive increases in cases.
fig, ax = plt.subplots(figsize = (12, 8))
grouped = df_combined.groupby('State')
for key, group in grouped:
group.plot(ax=ax, kind='line', x='Date', y='Ratio', label=key, color=colors[regions[key]], marker='o')
plt.xlabel('Date')
plt.ylabel('Ratio')
ax.set_title('Ratio of Population not Staying at home per day and total trips')
plt.legend(loc="best", bbox_to_anchor=(1.05, 1))
plt.show()
The states in the same regional category (M, N, S, W) are sharing the same color. There is a general negative decrease in ratio not staying at home while positive cases are increasing which could explain why the coefficient is negative. Regardless, this model is also unsatisfactory in establish a correlation.
data_slider3 = []
#populate the data object
for travel_date in df_trips_by_distance_state.Date.unique():
#select the data
df_travel = df_trips_by_distance_state[(df_trips_by_distance_state['Date'] == travel_date)]
for col in df_travel.columns:
df_travel[col] = df_travel[col].astype(str)
#create the text for mouse-hover for each state, for the current year
df_travel['text'] = df_travel['State'] + 'Number of Population Not Staying at Home: '+ df_travel['Population Not Staying at Home']
### create the dictionary with the data for the current date
data = dict(
type = 'choropleth',
locations = df_travel['State'],
z = df_travel['Population Not Staying at Home'].astype(float),
locationmode = 'USA-states',
colorscale = scl,
text = df_travel['text'],
colorbar = {"title": "# of Population"}
)
data_slider3.append(data) #add the dictionary to the list of dictionaries for the slider
#create the steps for the slider
steps3 = []
for i in range(len(data_slider3)):
step = dict(method = 'restyle', args=['visible', [False] * len(data_slider3)], label='Day {}'.format(i + 1)) # label to be displayed for each step (date)
step['args'][1][i] = True
steps3.append(step)
#create the 'sliders' object from the 'steps'
sliders = [dict(active=0, pad={"t": 1}, steps = steps3)]
#set up the layout (including slider option)
layout = go.Layout(
autosize = False, title = 'Number of Population Not Staying at Home', width = 1000, height = 600,
geo = go.layout.Geo(
scope = 'usa',
projection = go.layout.geo.Projection(type = 'albers usa'),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
sliders=sliders,
)
#create the figure object
fig3 = dict(data = data_slider3, layout = layout)
#plot in the notebook
iplot(fig3)
C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\PC\miniconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Again, the relatively highly populous states make the less populous states' data look static. Hover over each state to see the accurate numbers increase or decrease.
Since the data has a heavy reliance on time, we will consider it like a stock price analysis, where today's gains are have some dependence on yesterday's gains.
#Reset arrays for storing results
params = []
pvals = []
rsquared = []
#Repeat for every state
for state in state_list:
#Copy new dataframe for time series analysis
exec(f'ts_{state} = df_combined_{state}[["Date","positiveIncrease","State","Ratio","nTrips"]].copy()')
#Add yesterdays positiveIncreases as a predictor
exec(f'ts_{state}["Yesterday"]=ts_{state}["positiveIncrease"].shift(1)')
#Create OLS model
exec(f'outcome3_{state}, predictors3_{state} = patsy.dmatrices("positiveIncrease ~ nTrips + Ratio + Yesterday", ts_{state})')
exec(f'mod3_{state} = sm.OLS(outcome3_{state}, predictors3_{state})')
exec(f'res3_{state} = mod3_{state}.fit()')
#Store params
exec(f'params.append(res3_{state}.params)')
exec(f'pvals.append(res3_{state}.pvalues)')
exec(f'rsquared.append(res3_{state}.rsquared)')
#Create results dataframes
res3_params = pd.DataFrame(params, index = state_list, columns = ['Intercept','nTrips','Ratio','Yesterday'])
res3_pvals = pd.DataFrame(pvals, index = state_list, columns = ['pIntercept','pnTrips','pRatio','pYesterday'])
res3_rsquareds = pd.DataFrame(rsquared, index = state_list, columns = ['rSquared'])
#Merge results
df_res3 = pd.concat([res3_params, res3_pvals, res3_rsquareds],axis=1)
df_res3
| Intercept | nTrips | Ratio | Yesterday | pIntercept | pnTrips | pRatio | pYesterday | rSquared | |
|---|---|---|---|---|---|---|---|---|---|
| AL | -1828.793051 | -8.272529e-05 | 3946.861987 | 0.721172 | 0.097404 | 0.005189 | 0.027606 | 6.542173e-42 | 0.624130 |
| AK | -145.284945 | -6.152573e-05 | 403.839463 | 0.509201 | 0.077010 | 0.000059 | 0.008228 | 2.374574e-18 | 0.453371 |
| AZ | -2493.707133 | -7.191595e-05 | 5417.446835 | 0.838287 | 0.051590 | 0.017085 | 0.025581 | 1.213139e-68 | 0.765637 |
| AR | -2106.900148 | -1.163155e-04 | 4109.549565 | 0.477105 | 0.046944 | 0.002582 | 0.014065 | 3.228960e-16 | 0.305611 |
| CA | -2941.180732 | -1.997463e-05 | 7209.614404 | 0.887676 | 0.272443 | 0.044483 | 0.140804 | 8.375765e-88 | 0.862140 |
| CO | 482.254774 | -9.400432e-06 | -282.747695 | 0.669200 | 0.059416 | 0.196149 | 0.572437 | 1.063607e-34 | 0.591320 |
| CT | 2700.494901 | 4.560703e-05 | -4073.888026 | 0.367193 | 0.000064 | 0.088236 | 0.000763 | 9.539370e-10 | 0.261930 |
| DC | 170.098944 | 7.763403e-06 | -264.855262 | 0.694795 | 0.000659 | 0.357921 | 0.006590 | 3.100522e-39 | 0.599579 |
| DE | 584.657620 | 2.611916e-05 | -802.751876 | 0.336354 | 0.000227 | 0.203822 | 0.002356 | 2.779755e-08 | 0.204191 |
| FL | 81.960767 | 1.312187e-05 | -1022.019427 | 0.957937 | 0.973529 | 0.538590 | 0.825750 | 1.136042e-121 | 0.901640 |
| GA | -558.400949 | 1.350562e-05 | 236.498653 | 0.942517 | 0.540248 | 0.298332 | 0.875168 | 1.034203e-106 | 0.860760 |
| HI | 194.565128 | 4.682908e-06 | -269.925921 | 0.735439 | 0.135117 | 0.689337 | 0.233822 | 1.461775e-45 | 0.589401 |
| ID | -133.877582 | -1.712283e-05 | 312.068279 | 0.877577 | 0.511385 | 0.342922 | 0.389937 | 2.035335e-76 | 0.796459 |
| IL | 824.184808 | 2.300819e-06 | -986.361687 | 0.833336 | 0.477683 | 0.873925 | 0.661674 | 5.274512e-63 | 0.707064 |
| IN | 151.972640 | 6.040092e-06 | -306.634283 | 0.888107 | 0.672843 | 0.504076 | 0.645199 | 1.807337e-80 | 0.776494 |
| IA | -23.412356 | -4.189529e-06 | 168.199020 | 0.803085 | 0.960960 | 0.834595 | 0.841720 | 1.344637e-58 | 0.645843 |
| KS | -211.684564 | -1.011573e-04 | 1887.396149 | -0.268409 | 0.816404 | 0.023828 | 0.257850 | 1.001315e-05 | 0.095072 |
| KY | -160.724609 | -2.196942e-05 | 747.625781 | 0.692420 | 0.716831 | 0.131385 | 0.349236 | 2.397253e-36 | 0.527720 |
| LA | -2670.553734 | -1.839392e-04 | 7498.503981 | 0.392012 | 0.158329 | 0.000022 | 0.015953 | 1.354349e-11 | 0.318308 |
| ME | 86.537527 | 8.892228e-06 | -153.742472 | 0.653964 | 0.001884 | 0.001009 | 0.001392 | 4.785050e-34 | 0.520875 |
| MD | 721.774676 | 7.807182e-06 | -1049.330215 | 0.843106 | 0.084851 | 0.386846 | 0.144239 | 4.977578e-69 | 0.747873 |
| MA | 5015.497877 | 5.594801e-05 | -8470.804955 | 0.475087 | 0.000477 | 0.038363 | 0.001951 | 7.055184e-16 | 0.330672 |
| MI | 2948.433286 | 1.080491e-05 | -4016.312564 | 0.529551 | 0.000023 | 0.339102 | 0.001467 | 8.450527e-19 | 0.523653 |
| MN | -619.337012 | -2.294205e-05 | 1405.919031 | 0.839213 | 0.139254 | 0.052222 | 0.072525 | 5.534211e-71 | 0.734857 |
| MS | -1627.454180 | -7.863876e-05 | 2976.945547 | 0.671816 | 0.100157 | 0.028090 | 0.050289 | 3.901686e-33 | 0.527816 |
| MO | -469.628367 | -1.730118e-05 | 1138.365730 | 0.843954 | 0.517804 | 0.370213 | 0.402084 | 2.616304e-67 | 0.731149 |
| MT | -45.744444 | -5.429528e-06 | 92.293456 | 0.872608 | 0.483735 | 0.596820 | 0.441588 | 1.987787e-71 | 0.771468 |
| NE | 146.875716 | 9.224634e-06 | -225.854534 | 0.823430 | 0.515142 | 0.533955 | 0.561062 | 1.107647e-62 | 0.673858 |
| NV | -597.855568 | -4.089399e-05 | 1333.626875 | 0.889325 | 0.041954 | 0.010542 | 0.018308 | 1.310204e-88 | 0.863232 |
| NH | 336.520551 | 2.431497e-05 | -549.717493 | 0.224288 | 0.000016 | 0.000437 | 0.000032 | 3.401029e-04 | 0.163758 |
| NJ | 2108.000654 | 1.426702e-05 | -3336.003161 | 0.888508 | 0.005091 | 0.245485 | 0.016269 | 2.297282e-94 | 0.870881 |
| NM | -92.675778 | -1.735192e-05 | 324.690568 | 0.753420 | 0.532053 | 0.043530 | 0.222937 | 6.614233e-47 | 0.655574 |
| NY | 2487.598679 | 1.842568e-05 | -5335.017911 | 0.952477 | 0.046749 | 0.052166 | 0.035099 | 6.015303e-131 | 0.932789 |
| NC | -1645.485594 | -3.135755e-05 | 3658.472330 | 0.763026 | 0.099235 | 0.026614 | 0.029136 | 4.418432e-48 | 0.644608 |
| ND | 102.476987 | 3.163613e-05 | -216.128492 | 0.904506 | 0.381747 | 0.116803 | 0.279222 | 4.283821e-95 | 0.816317 |
| OH | -190.055184 | 1.344266e-06 | 224.383572 | 0.945266 | 0.606358 | 0.818195 | 0.756318 | 1.978198e-116 | 0.880661 |
| OK | -81.209199 | 8.723583e-06 | 7.760140 | 0.909162 | 0.861291 | 0.609318 | 0.992515 | 1.902249e-86 | 0.804486 |
| OR | -157.905373 | -1.243569e-05 | 462.139127 | 0.804606 | 0.247188 | 0.015145 | 0.076585 | 2.540507e-58 | 0.742086 |
| PA | 1586.245369 | 7.718757e-06 | -2362.820726 | 0.795440 | 0.008658 | 0.260857 | 0.028044 | 1.679273e-54 | 0.721062 |
| RI | 150.027281 | 2.193918e-07 | -180.137919 | 0.846271 | 0.250846 | 0.989602 | 0.443156 | 1.463291e-69 | 0.754052 |
| SC | -128.504686 | -5.235778e-06 | 369.583375 | 0.855448 | 0.857295 | 0.776802 | 0.743793 | 9.057806e-70 | 0.735502 |
| SD | 129.654896 | 3.269728e-05 | -275.809030 | 0.896683 | 0.256296 | 0.024519 | 0.135142 | 1.432762e-88 | 0.791715 |
| TN | -2844.158040 | -8.012722e-05 | 6030.236163 | 0.759293 | 0.023791 | 0.002628 | 0.006015 | 4.763556e-51 | 0.660764 |
| TX | -5985.899201 | -4.202048e-05 | 12619.873903 | 0.810367 | 0.304626 | 0.119667 | 0.202982 | 4.729466e-57 | 0.713014 |
| UT | -114.118133 | -3.803875e-06 | 238.835577 | 0.904120 | 0.658562 | 0.724141 | 0.611068 | 2.398833e-88 | 0.818555 |
| VT | 30.024489 | -2.754106e-06 | -28.756783 | 0.662372 | 0.025239 | 0.193841 | 0.175939 | 1.133417e-33 | 0.580452 |
| VA | 252.862361 | -1.026793e-05 | 193.492509 | 0.786773 | 0.654699 | 0.264441 | 0.843092 | 1.645440e-54 | 0.663787 |
| WA | -375.775669 | -1.954180e-05 | 1214.042838 | 0.692400 | 0.396361 | 0.026011 | 0.135505 | 4.595653e-37 | 0.554843 |
| WV | -56.099086 | -1.686960e-05 | 219.221233 | 0.704879 | 0.573674 | 0.035173 | 0.203062 | 3.679057e-40 | 0.550724 |
| WI | 151.116121 | 9.917330e-06 | -367.950380 | 0.902418 | 0.809611 | 0.532312 | 0.745401 | 2.357136e-89 | 0.802143 |
| WY | -43.568136 | -2.181710e-05 | 128.455244 | 0.540952 | 0.371407 | 0.024843 | 0.124055 | 3.273509e-21 | 0.340088 |
print('Average Accuracy: ', df_res3['rSquared'].mean())
print('Best Accuracy: ', df_res3['rSquared'].max())
print('Worst Accuracy: ', df_res3['rSquared'].min())
Average Accuracy: 0.6374506960622269 Best Accuracy: 0.9327888524308207 Worst Accuracy: 0.09507175213166963
This model works really well in some states, and terribly in others. Lets see whether this can be a function of region given that it looked like region had an effect on Ratio earlier.
Below is the State of California's OLS regression result
print(res3_CA.summary())
OLS Regression Results
==============================================================================
Dep. Variable: positiveIncrease R-squared: 0.862
Model: OLS Adj. R-squared: 0.861
Method: Least Squares F-statistic: 537.8
Date: Tue, 15 Dec 2020 Prob (F-statistic): 1.16e-110
Time: 23:28:19 Log-Likelihood: -2224.0
No. Observations: 262 AIC: 4456.
Df Residuals: 258 BIC: 4470.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -2941.1807 2674.277 -1.100 0.272 -8207.371 2325.010
nTrips -1.997e-05 9.89e-06 -2.019 0.044 -3.95e-05 -4.96e-07
Ratio 7209.6144 4880.127 1.477 0.141 -2400.338 1.68e+04
Yesterday 0.8877 0.029 30.591 0.000 0.831 0.945
==============================================================================
Omnibus: 87.877 Durbin-Watson: 2.367
Prob(Omnibus): 0.000 Jarque-Bera (JB): 549.681
Skew: 1.180 Prob(JB): 4.35e-120
Kurtosis: 9.692 Cond. No. 7.87e+09
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.87e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
#Add region column
df_res3['region'] = df_res3.index.map(regions)
print('West')
print('Average Accuracy: ', df_res3[df_res3['region'] == 'W']['rSquared'].mean())
print('Best Accuracy: ', df_res3[df_res3['region'] == 'W']['rSquared'].max())
print('Worst Accuracy: ', df_res3[df_res3['region'] == 'W']['rSquared'].min())
print('Midwest')
print('Average Accuracy: ', df_res3[df_res3['region'] == 'M']['rSquared'].mean())
print('Best Accuracy: ', df_res3[df_res3['region'] == 'M']['rSquared'].max())
print('Worst Accuracy: ', df_res3[df_res3['region'] == 'M']['rSquared'].min())
print('South')
print('Average Accuracy: ', df_res3[df_res3['region'] == 'S']['rSquared'].mean())
print('Best Accuracy: ', df_res3[df_res3['region'] == 'S']['rSquared'].max())
print('Worst Accuracy: ', df_res3[df_res3['region'] == 'S']['rSquared'].min())
print('North')
print('Average Accuracy: ', df_res3[df_res3['region'] == 'N']['rSquared'].mean())
print('Best Accuracy: ', df_res3[df_res3['region'] == 'N']['rSquared'].max())
print('Worst Accuracy: ', df_res3[df_res3['region'] == 'N']['rSquared'].min())
West Average Accuracy: 0.6667790980223166 Best Accuracy: 0.8632320100125113 Worst Accuracy: 0.34008848425840366 Midwest Average Accuracy: 0.6981314329625259 Best Accuracy: 0.8806608778503784 Worst Accuracy: 0.09507175213166963 South Average Accuracy: 0.6313479044239492 Best Accuracy: 0.9016402817300644 Worst Accuracy: 0.30561148070438615 North Average Accuracy: 0.5573428364398568 Best Accuracy: 0.9327888524308207 Worst Accuracy: 0.16375769985471778
It doesn't look like region has anything to do with accuracy of the model. However, the average high accuracies combined with the small p-values for Yesterday's gains shows that this model captures a lot of the variance for most states.
The datasets we are using have been stripped of personal identifiers and has been publicly shared by the COVID 19 Project and NY Times so privacy concerns did not apply. With that being said, we are cognizant of our data including deaths and ventilator usage, and while not the direct topic of our research we will keep these implications in the forefront of our minds. Our intentions with this question is to identify how interstate travel is correlated with COVID cases to provide policymakers with more information to better inform their decision making, and not put blame on those who travel interstate. The data we found is generated by volunteers collecting publicly shared information by county, hospital, and state and then compiling it into a central database. We expect the data to be generally unbiased as the data collected is nondiscriminatory, however it may be underrepresentative of 'true' figures due to lack of self reporting by individuals, delays in the updating of records, and lack of official government support.
We also understand that interstate travel is the primary source of income for many individuals (i.e. delivery truck drivers) and do not intend to presume that they should be held responsible for the amount of COVID cases found as a result of our research. Additionally, if our research does lead to interstate travel restrictions, we hope that those whose incomes are reliant upon interstate travel are properly supported and that no blame should be placed onto them.
Again, our goal is to help inform future policy changes and raise awareness regarding transmission rates by travel, rather than place fault upon any individual.
The general understanding of how COVID-19 is spread indicates that direct, indirect, and/or close interaction with those who have COVID-19 are the most common forms. Our study concludes that the association between the number of trips and the number of individuals who tested positive for COVID-19 more than likely boiled down to a number of confounding variables that we did not or could not fully investigate. Although we may have found certain correlations between the number of trips and the number of individuals with a positive result for COVID-19, all of these correlations were rather weak when we further analyzed and investigated the data.
Due to the high variability of populations between states, we found that comparing the number of trips between all states resulted in exaggerated numbers of trips for states with higher populations. Therefore, in order to mediate and further analyze the data, we decided to instead investigate the relationship between the population of the state and the number of COVID-19 cases. By doing this we found that the number of people who tested positive for COVID-19 significantly correlated with the population of the state. Therefore, we ultimately decided to examine our data on a state level inorder to avoid any regional and national biases.
For our study we used models such as linear regression to perform various means of data analysis. Within our data analysis the number of trips and the novel cases of COVID-19 were heavily scrutinized. Although, we soon realized that most of our data, specifically between the months of March and April, were more than likely biased as around that time of the year COVID-19 testing was not done appropriately. While we safely assumed that the trips data from the Bureau of Transportation Statistics accurately reproduced the actual travel data, we understood that the daily COVID-19 cases data likely weren’t as accurate due to limited testing availability and various state/local mandates. We believe that this data discrepancy was one of the variables that led to our result to be inconclusive.
Overall, our models predict that there is no direct correlation between the spread of COVID-19 and the number of trips within a state. However, we found that the data we collected is heavily reliant on time. Therefore, after considering yesterday's positive increases, we found that the model gained in accuracy but the significance of total number of trips remains close to zero. This implies that the number of trips has no effect on new positive cases. While we couldn't explicitly test for it, we would make a new hypothesis that travelling with social distancing measures (i.e. facemasks) has no effect on new cases, and may help with future policy guidelines. In the future, as more data begins to arise, we hope to build a model that provides meaningful insights. Additionally, we realized that the ever increasing threat of COVID-19 poses an excessive danger to the international community. Thus, we found that there have been multiple studies conducted to help search for and discover ways in which outbreaks of the virus can be prevented [1]. We believe that that should serve as a foundation of hope for the eventual eradication of COVID-19.
[1] Frisan, T. (2020). Faculty Opinions recommendation of Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2). Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature. doi:10.3410/f.737557783.793572927.